• Show All Code
  • Hide All Code

ST661 (R for Data Analytics)- Group Project

Group I- BHANU NAGA SAI VAMSI BITRA,SILVIU-DANIEL FÁZEKAS,SANTHOSH PRABHAKARAN,ANJU BABY

“Analyzing Trends and Patterns in Olympic Performance Using Athlete and Medal Data”

Introduction

In this project, we analyze two datasets:

Athletes Dataset: Contains detailed information about athletes, including their names, sports, countries, physical attributes (height, weight), and date of birth.

Medals Dataset: Provides details about the medals won, including the type of medal (Gold, Silver, or Bronze), the event, the athlete’s country, and the date the medal was awarded.

These datasets are sourced from reliable Olympic data repositories, ensuring the accuracy and depth of the analysis. The goal is to uncover insights such as:

Patterns in medal distribution across countries.

Trends in age and performance.

The dominance of certain countries or athletes in specific sports.

Relationship between athlete heights and their sports.

Dataset Structure:

Variables such as Full_Name, Sports, Country, Medal_Type, Event, Medal_Date and Height are used for analysis. These variables allow exploration of athlete characteristics, performance trends, and medal distributions.

Dataset information:

Athletes:

# Define file paths for the datasets (correcting file extensions and slashes)
athletes_path <- "C:/Users/USER/OneDrive/Desktop/ST661[A] PROJECT GROUP I/athletes.csv"
medals_path <- "C:/Users/USER/OneDrive/Desktop/ST661[A] PROJECT GROUP I/medals.csv"

# Load datasets
library(readr)

# Load the datasets using read_csv
athletes_df <- read_csv(athletes_path, show_col_types = FALSE)
medals_df <- read_csv(medals_path, show_col_types = FALSE)

# Preview the first few rows of each dataset
head(athletes_df)
## # A tibble: 6 × 36
##     code current name  name_short name_tv gender `function` country_code country
##    <dbl> <lgl>   <chr> <chr>      <chr>   <chr>  <chr>      <chr>        <chr>  
## 1 1.53e6 TRUE    ALEK… ALEKSANYA… Artur … Male   Athlete    ARM          Armenia
## 2 1.53e6 TRUE    AMOY… AMOYAN M   Malkha… Male   Athlete    ARM          Armenia
## 3 1.53e6 TRUE    GALS… GALSTYAN S Slavik… Male   Athlete    ARM          Armenia
## 4 1.53e6 TRUE    HARU… HARUTYUNY… Arsen … Male   Athlete    ARM          Armenia
## 5 1.53e6 TRUE    TEVA… TEVANYAN V Vazgen… Male   Athlete    ARM          Armenia
## 6 1.53e6 TRUE    AREN… ARENAS L   Lorena… Female Athlete    COL          Colomb…
## # ℹ 27 more variables: country_long <chr>, nationality <chr>,
## #   nationality_long <chr>, nationality_code <chr>, height <dbl>, weight <dbl>,
## #   disciplines <chr>, events <chr>, birth_date <date>, birth_place <chr>,
## #   birth_country <chr>, residence_place <chr>, residence_country <chr>,
## #   nickname <chr>, hobbies <chr>, occupation <chr>, education <chr>,
## #   family <chr>, lang <chr>, coach <chr>, reason <chr>, hero <chr>,
## #   influence <chr>, philosophy <chr>, sporting_relatives <chr>, …

Medals:

head(medals_df)
## # A tibble: 6 × 13
##   medal_type   medal_code medal_date name     gender discipline event event_type
##   <chr>             <dbl> <date>     <chr>    <chr>  <chr>      <chr> <chr>     
## 1 Gold Medal            1 2024-07-27 Remco E… M      Cycling R… Men'… ATH       
## 2 Silver Medal          2 2024-07-27 Filippo… M      Cycling R… Men'… ATH       
## 3 Bronze Medal          3 2024-07-27 Wout va… M      Cycling R… Men'… ATH       
## 4 Gold Medal            1 2024-07-27 Grace B… W      Cycling R… Wome… ATH       
## 5 Silver Medal          2 2024-07-27 Anna HE… W      Cycling R… Wome… ATH       
## 6 Bronze Medal          3 2024-07-27 Chloe D… W      Cycling R… Wome… ATH       
## # ℹ 5 more variables: url_event <chr>, code <chr>, country_code <chr>,
## #   country <chr>, country_long <chr>

Data Wrangling

In this project, we handled missing values and cleaned the datasets to prepare them for analysis.

Handling Missing Values:

Critical columns like Full_Name, Country, and Sports were filtered to remove rows with missing or irrecoverable values.

Column Selection and Renaming:

Unnecessary columns from both datasets were removed to focus only on relevant fields. Columns were renamed for consistency and clarity (e.g., name_tv to Full_Name and medal_type to Medal_Type).

Data Cleaning:

Brackets, quotes, and formatting inconsistencies in categorical columns were removed. Columns such as Birth_Date and Medal_Date were converted to proper date formats.

Merging Datasets:

The datasets were merged using common columns (Full_Name, Sports, Country) to form a unified dataset (combined_df). This combined dataset includes all essential variables like Medal_Type, Event, Height, Weight, and Medal_Date.

## Data cleaning

# Remove the specified columns in athletes 

df_clean_ath <- athletes_df %>%
  select(-c(current, name, name_short, country_code, country, 
            country_long, nationality_long, nationality_code, birth_place, 
            birth_country, residence_place, residence_country, nickname, 
            hobbies, occupation, education, family, lang, coach, reason, 
            hero, influence, philosophy, sporting_relatives, ritual, 
            other_sports, events))

# Remove the specified columns in medals

Me_cleaned <- medals_df %>%
  select(-c(medal_code,event_type,url_event,code,country_code,country_long,gender))
#Rename columns in athletes
df_ath <- df_clean_ath %>%
  rename(
    `Full_Name` = name_tv,
    Sports = disciplines,
    Country = nationality,
    Height = height,
    Weight = weight,
    Birth_Date = birth_date,
    Gender = gender,
    Code = code
  )

#Rename columns in medals

df_med <- Me_cleaned %>%
  rename(
    Medal_Type= medal_type,
    Medal_Date= medal_date,
    Full_Name = name,
    Sports= discipline,
    Country = country,
    Event =event
  )
# Athletes Data set

# Remove brackets and quotes from 'sports'

library(stringr)

df_cl <- df_ath %>%
  mutate(
    Sports = str_replace_all(Sports, "\\[|\\]|'", ""), 
  )

# Convert character to date in "dd-mm-yyyy" format
df_cl$Birth_Date <- as.Date(df_cl$Birth_Date, format="%d-%m-%Y")

#head(df_cl$Birth_Date)

#str(df_cl)
## Count NA and empty string values in the entire data frame

#count_na_empty <- apply(df_cl, 2, function(x) sum(is.na(x) | x == ""))

#count_na_empty
# removing rows with missing values for key columns

df_clea <- df_cl %>%
  filter(!is.na(`Full_Name`), !is.na(Country), !is.na(Sports), !is.na(Birth_Date), !is.na(Height), !is.na(Weight))


#count_na_empty <- apply(df_clea, 2, function(x) sum(is.na(x) | x == ""))

#count_na_empty

#Remove duplicate rows, if any


df_cleaned <- df_clea %>% distinct()

#checking data types of dataframe

#str(df_cleaned)

#convert char to date 

df_med$Medal_Date <- as.Date(df_med$Medal_Date, format="%d-%m-%Y")

#head(df_med$Medal_Date)

#str(df_med)
#checking null values 

#count_na_empty <- apply(df_med, 2, function(x) sum(is.na(x) | x == ""))

#count_na_empty

## Merge athletes and medals datasets using common columns
combined_df <- inner_join(df_med, df_cleaned, by = c("Full_Name", "Sports", "Country"))

head(combined_df)
## # A tibble: 6 × 12
##   Medal_Type  Medal_Date Full_Name Sports Event Country   Code Gender `function`
##   <chr>       <date>     <chr>     <chr>  <chr> <chr>    <dbl> <chr>  <chr>     
## 1 Gold Medal  2024-07-27 Remco EV… Cycli… Men'… Belgium 1.90e6 Male   Athlete   
## 2 Bronze Med… 2024-07-27 Wout van… Cycli… Men'… Belgium 1.90e6 Male   Athlete   
## 3 Gold Medal  2024-07-27 Grace BR… Cycli… Wome… Austra… 1.94e6 Female Athlete   
## 4 Silver Med… 2024-07-27 Anna HEN… Cycli… Wome… Great … 1.91e6 Female Athlete   
## 5 Gold Medal  2024-07-27 OH Sanguk Fenci… Men'… Korea   1.93e6 Male   Athlete   
## 6 Silver Med… 2024-07-27 Fares FE… Fenci… Men'… Tunisia 1.94e6 Male   Athlete   
## # ℹ 3 more variables: Height <dbl>, Weight <dbl>, Birth_Date <date>
#write.csv(combined_df, "combined_Clen_dataset.csv", row.names = FALSE)
#Transform the Data
#combined_df <- read.csv("E:/New folder (4)/olympics/combined_Clen_dataset1.csv")
# Assuming medals_df has a column 'medal_date' with values like '2024-07-27'
medals_df <- combined_df %>%
  mutate(year = year(as.Date(Medal_Date)))
#write.csv(combined_df, "combined_Clen_dataset1.csv", row.names = FALSE)

Final Output:

The cleaned and merged dataset has no missing values and is ready for analysis. Numerical and categorical data were validated for correctness, and duplicates were removed. This cleaned dataset provides a reliable foundation for uncovering trends and patterns in Olympic performance.

Data Analysis

Story 1

Mean Age of Medal-Winning Athletes by Sport and Gender

The goal of this plot is to analyze the average age of medal-winning athletes across sports, highlighting differences between male and female athletes. It identifies sport-specific age trends and gender-based variations in peak performance. This insight helps understand how age impacts success in various disciplines.

# Calculate the mean age and round to the nearest whole number
mean_age_df <- combined_df %>%
  mutate(Age = as.numeric(difftime(as.Date(Medal_Date), Birth_Date, units = "days")) / 365.25) %>%
  group_by(Sports, Gender) %>%
  summarise(
    Mean_Age = round(mean(Age, na.rm = TRUE)),  # Round the mean age
    .groups = "drop"
  )

# Create a custom color palette for gender
gender_colors <- c("Male" = "blue", "Female" = "red")

# Create the line plot using plotly
plot <- plot_ly(mean_age_df, 
                x = ~Sports, 
                y = ~Mean_Age, 
                color = ~Gender, 
                colors = gender_colors,  # Apply custom colors
                type = 'scatter', 
                mode = 'lines+markers') %>%
  layout(
    title = "Mean Age of Medal-Winning Athletes by Sport and Gender",
    xaxis = list(title = "Sports", tickangle = 45),  # Rotate x-axis labels
    yaxis = list(title = "Mean Age"),
    legend = list(title = list(text = "Gender"))
  )

# Print the plot
plot

Story 2

Athletes Who Won Multiple Medals In A Particular Sport

The goal of this plot is to showcase athletes who have won multiple medals in a particular sport, emphasizing their exceptional performance and consistency. It highlights individual achievements and identifies athletes who dominate their respective disciplines. This insight helps recognize top performers and trends in medal-winning excellence.

# Group by athlete to find athletes who won multiple medals
multi_event_medalists <- combined_df %>%
  group_by(Full_Name, Sports, Country) %>%
  summarise(medals_won = n_distinct(Event), .groups = "drop") %>%
  filter(medals_won > 1)  # Filter only athletes who won more than 1 medal



# Create Plotly scatter plot for multi-event medalists with both athlete's name, sport and country
multi_event_medalists <- multi_event_medalists %>%
  arrange(desc(medals_won))  # Ensure medals_won is ordered

# Create the plot
plot <- plot_ly(
  data = multi_event_medalists,
  x = ~medals_won,  # Number of medals on the x-axis
  y = ~reorder(Full_Name, medals_won),  # Reorder athletes by medals won
  type = 'scatter',
  mode = 'markers',  # Scatter plot with markers
  marker = list(color = 'purple', size = 10, opacity = 0.7),  # Marker customization
  text = ~paste(
    "Athlete: ", Full_Name,
    "<br>Sport: ", Sports,
    "<br>Country: ", Country,
    "<br>Medals Won: ", medals_won
  ),  # Tooltip text
  hoverinfo = "text"  # Display tooltip text
) %>%
  layout(
    title = 'Athletes Who Won Multiple Medals In A Particular Sport',
    xaxis = list(title = 'Number of Medals Won'),
    yaxis = list(title = 'Athlete', categoryorder = "total ascending"),  # Sort y-axis
    template = 'plotly_white'
  )

# Display the plot
plot

Story 3

Top 10 Countries by Medal Count

The goal of this plot is to highlight the top 10 countries by total medal count, breaking it down by gold, silver, and bronze medals. It showcases which nations excel in international sports competitions and their relative strengths in achieving higher-tier medals. This insight reflects the countries’ dominance and investment in athletics.

# Summarize medal distribution by country and medal type
medal_distribution <- combined_df %>%
  group_by(Country, Medal_Type) %>%
  summarise(medal_count = n(), .groups = "drop")

# Summarize the total medal count per country
total_medals_by_country <- medal_distribution %>%
  group_by(Country) %>%
  summarise(total_medals = sum(medal_count), .groups = "drop") %>%
  arrange(desc(total_medals)) %>%
  slice_head(n = 10)  # Get top 10 countries

# Filter for the top 10 countries in the original medal distribution
top_10_medal_distribution <- medal_distribution %>%
  filter(Country %in% total_medals_by_country$Country)



# Set custom colors for the medals
medal_colors <- c("Gold Medal" = "gold", "Silver Medal" = "gray", "Bronze Medal" = "darkorange3")

# Ensure the Medal_Type column has the correct order
top_10_medal_distribution$Medal_Type <- factor(
  top_10_medal_distribution$Medal_Type,
  levels = c("Gold Medal", "Silver Medal", "Bronze Medal")
)

# Add tooltips for interactivity
top_10_medal_distribution <- top_10_medal_distribution %>%
  mutate(tooltip = paste0(
    "Country: ", Country, "<br>",
    "Medal Type: ", Medal_Type, "<br>",
    "Medal Count: ", medal_count
  ))
interactive_plot <- ggplot(top_10_medal_distribution, aes(
  x = reorder(Country, -medal_count),
  y = medal_count,
  fill = Medal_Type
)) +
  geom_bar_interactive(
    stat = "identity",
    position = "stack",
    aes(tooltip = tooltip, data_id = interaction(Country, Medal_Type))
  ) +
  theme_minimal() +
  labs(
    title = "Top 10 Countries by Medal Count",
    x = "Country",
    y = "Medal Count",
    fill = "Medal Type"
  ) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(size = 20) # Increase the font size of the title
  ) +
  scale_fill_manual(values = medal_colors)

# Render the interactive plot using 'girafe'
girafe(ggobj = interactive_plot, width_svg = 12, height_svg = 8)

Story 4

Analyzing the Relationship Between Athlete Heights and Their Sports

The goal of this plot is to compare the height distributions of male and female athletes across different sports. It helps identify gender-based height differences and variations in height trends within each sport. This can provide insights into height’s role in athletic performance and suitability for specific sports.

# Load required libraries
library(ggplot2)
library(dplyr)

# Filter out rows where height is 0
filtered_data <- combined_df %>%
  filter(Height > 0)

# Create a tooltip column for interactivity (optional for advanced features)
filtered_data <- filtered_data %>%
  mutate(tooltip = paste("Athlete:", Full_Name, 
                         "<br>Gender:", Gender,
                         "<br>Sport:", Sports,
                         "<br>Height:", Height, "cm"))

# Create the boxplot
plot <- ggplot(filtered_data, aes(x = Sports, y = Height, fill = Gender)) +
  geom_boxplot(outlier.shape = NA, alpha = 0.7) +  # Standard boxplot
  labs(
    title = "Boxplot of Height vs Sports Grouped by Gender",
    x = "Sport",
    y = "Height (cm)",
    fill = "Gender"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)  # Rotate x-axis labels for readability
  )

# Display the plot
print(plot)

Conclusion

The analysis of Olympic datasets has provided significant insights into the performance trends of athletes and countries. Here are the key takeaways:

1) Mean Age of Medal-Winning Athletes by Sport and Gender

The analysis revealed significant variations in the average age of medal-winning athletes across different sports and genders. Certain sports, such as gymnastics for males(22) and skateboarding for females (16) tend to favor younger athletes, while others, like equestrian events for both males (40) and females (39), see more experienced participants excelling. The interactive visualization showcased these patterns, providing insights into age-specific competitiveness and the physical and mental readiness required for various disciplines.

2) Athletes Who Won Multiple Medals in a Particular Sport

This analysis identified athletes who secured multiple medals in a single sport, highlighting their exceptional performance and contribution to their country’s success. The visualization emphasized individual dominance, allowing users to explore the achievements of these athletes. Athletes such as Summer McINTOSH from Canada and Leon MARCHAND from France both won four medals each highlighting exceptional achievement.

3) Top 10 Countries by Medal Count:

The analysis of the top 10 countries with the highest medal counts showcased their dominance in Olympic sports. Countries such as USA and China consistently performed well, excelling across various disciplines. Japan performed well in comparison to the gold medal ratio it has to the other countries. The stacked bar chart provided a clear breakdown of medal types (Gold, Silver, Bronze) for each country, offering a comprehensive understanding of their achievements.

4) Height Correlation by Sport

The analysis of athletes’ heights across various sports revealed notable trends. Sports like athletics and badminton were dominated by taller athletes, as their height offers a competitive edge in terms of reach and performance. On the other hand, sports such as sport climbing saw a higher prevalence of shorter athletes, where agility, flexibility, and a lower center of gravity are crucial for success.

Additionally, gender-based differences in height were observed, with male athletes generally displaying greater height variability compared to female athletes. The box-plot visualization effectively showcased these patterns, emphasizing how physical demands shape the height distribution across sports and genders.

Authors’ statements :

“I ,BHANU NAGA SAI VAMSI BITRA had primary responsibility for the material in the DATA WRANGLING section, involving the conceptualization of insights and their visualization.”

“I,SILVIU-DANIEL FAZEKAS, had primary responsibility for the material in ANALYTICS section.”

“I, SANTHOSH PRABHAKARAN, had primary responsibility for the material in the ANALYTICS section.”

“I, ANJU BABY, had primary responsibility for the material in the INTRODUCTION and CONCLUSION section.”